Eecient Algorithms for Decision Tree Cross-validation (extended Abstract)

نویسندگان

  • Hendrik Blockeel
  • Jan Struyf
چکیده

Extended abstract Cross-validation is a generally applicable and very useful technique for many tasks often encountered in machine learning, such as accuracy estimation, feature selection or parameter tuning. A common property of these tasks is that one wants to validate a learned theory on a set of examples not used for its construction (i.e., an \independent test set"). When insuucient data are available to reliably train on one subset of the data and validate the results on a disjoint subset, cross-validation provides an outcome. It consists of partitioning a data set D into n subsets D i and then running a given algorithm n times, each time using a diierent training set D ? D i and validating the results on D i. The results on each D i are averaged to provide a reliable estimate of the induced model's performance on unseen cases. An often mentioned disadvantage of cross-validation is its computational cost: the learning algorithm needs to be run n times. However, while conceptually this is true, it need not be implemented this way. The purpose of this paper is to show that, in a number of cases, a full cross-validation can be performed with only little overhead over the original induction algorithm. The main contributions of this work are as follows. We show how to extend classical algorithms for decision tree induction 5, 4] in such a way that a full cross-validation is integrated with the induction process at a minimal cost; the key is to observe that in a cross-validation a lot of redundant computations are performed, and by rearranging these computations we can often reuse results instead of recomputing them. We analyse the computational complexity of the novel algorithm, identifying those parameters that innuence the overhead most. It turns out that compared to the standard implementation of cross-validation, our method

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Eecient Algorithms for Decision Tree Cross-validation

Cross-validation is a useful and generally applicable technique often employed in machine learning, including decision tree induction. An important disadvantage of straightforward implementation of the technique is its computational overhead. In this paper we show that, for decision trees, the computational overhead of cross-validation can be reduced signiicantly by integrating the cross-valida...

متن کامل

Efficient algorithms for decision tree cross-validation

Cross-validation is a useful and generally applicable technique often employed in machine learning, including decision tree induction. An important disadvantage of straightforward implementation of the technique is its computational overhead. In this paper we show that, for decision trees, the computational overhead of cross-validation can be reduced significantly by integrating the crossvalida...

متن کامل

Comparison of Performance in Image Classification Algorithms of Satellite in Detection of Sarakhs Sandy zones

Extended abstract 1- Introduction Wind erosion as an “environmental threat” has caused serious problems in the world. Identifying and evaluating areas affected by wind erosion can be an important tool for managers and planners in the sustainable development of different areas.  nowadays there are various methods in the world for zoning lands affected by wind erosion. One of the most important...

متن کامل

Ensemble Classification and Extended Feature Selection for Credit Card Fraud Detection

Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...

متن کامل

Cross-Validated C4.5: Using Error Estimation for Automatic Parameter Selection

Machine learning algorithms for supervised learning are in wide use. An important issue in the use of these algorithms is how to set the parameters of the algorithm. While the default parameter values may be appropriate for a wide variety of tasks, they are not necessarily optimal for a given task. In this paper, we investigate the use of cross-validation to select parameters for the C4.5 decis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007